Adding --direct-io flag for model loading#18166
Conversation
|
I think you need this: diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 1355eea95..2db2115a0 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -918,8 +918,7 @@ void llama_model_loader::load_data_for(struct ggml_tensor * cur) const {
GGML_ASSERT(cur->data != nullptr);
GGML_ASSERT(w.idx < files.size());
const auto & file = files.at(w.idx);
- file->seek(w.offs, SEEK_SET);
- file->read_raw(cur->data, ggml_nbytes(cur));
+ file->read_raw_at(cur->data, ggml_nbytes(cur), w.offs);
}
if (check_tensors && !ggml_validate_row_data(cur->type, cur->data, ggml_nbytes(cur))) {Probably need to assert that |
|
Thanks for the hint, changed that. I think an assert would not work as |
Ok. Would we even need to have Also some suggestions that I have not tested, but should at least convey what I mean: diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 2db2115a0..ae0c698be 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -508,8 +508,11 @@ llama_model_loader::llama_model_loader(
files.emplace_back(new llama_file(fname.c_str(), "rb", use_direct_io));
contexts.emplace_back(ctx);
- // Disable mmap in case Direct I/O is enabled and available
- if (use_direct_io && files.at(0)->has_direct_io()) {
+ // check if direct io is enabled and supported
+ use_direct_io = use_direct_io && files.back()->has_direct_io();
+
+ if (use_direct_io && use_mmap) {
+ LLAMA_LOG_WARN("%s: direct I/O is enabled, disabling mmap\n", __func__);
use_mmap = false;
}
@@ -581,6 +584,10 @@ llama_model_loader::llama_model_loader(
files.emplace_back(new llama_file(fname_split, "rb", use_direct_io));
contexts.emplace_back(ctx);
+ if (use_direct_io && !files.back()->has_direct_io()) {
+ throw std::runtime_error(format("unexpected: direct I/O is not supported for split file %s", fname_split));
+ }
+
// Save tensors data offset info of the shard.
for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
std::string tensor_name = std::string(cur->name);
@@ -722,6 +729,7 @@ llama_model_loader::llama_model_loader(
}
this->use_mmap = use_mmap;
+ this->use_direct_io = use_direct_io;
this->check_tensors = check_tensors;
this->no_alloc = no_alloc;
}
diff --git a/src/llama-model-loader.h b/src/llama-model-loader.h
index de06b5283..6f15115ce 100644
--- a/src/llama-model-loader.h
+++ b/src/llama-model-loader.h
@@ -70,6 +70,7 @@ struct llama_model_loader {
size_t n_bytes = 0;
bool use_mmap = false;
+ bool use_direct_io = false;
bool check_tensors;
bool no_alloc;
diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index cf0c39475..502859d2e 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -2337,7 +2337,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
const bool use_mmap_buffer = true;
- LLAMA_LOG_INFO("%s: loading model tensors, this can take a while... (mmap = %s)\n", __func__, ml.use_mmap ? "true" : "false");
+ LLAMA_LOG_INFO("%s: loading model tensors, this can take a while... (mmap = %s, direct_io = %s)\n",
+ __func__, ml.use_mmap ? "true" : "false", ml.use_direct_io ? "true" : "false");
// build a list of buffer types for the CPU and GPU devices
pimpl->cpu_buft_list = make_cpu_buft_list(devices, params.use_extra_bufts, params.no_host); |
|
Just an FYI. #18012 Broke loading with mmap disabled on windows. |
|
@askmyteapot Thank you for the hint. Issue was that I used |
|
@ggerganov Should I isolate the changes with the Windows fix in a new PR? |
|
Yes, would like to take extra look at the changes here, so better to fix the windows issue in the meantime. Thanks |
NeoZhangJianyu
left a comment
There was a problem hiding this comment.
What's the parameters to load the model file in this PR?
Here is my understanding, please correct me if it's wrong:
--no-mmap -ndio
--no-mmap -dio
--mmap
When must user use -ndio?
I think the parameters are a little complex.
Here is my suggestion:
We should keep: --mmap and --no-mmap.
In the case of --no-mmap, the code should detect & switch to use dio or ndio smartly in windows/linux/mac.
|
IIUC, the implementation in this PR currently requires passing In other words, |
|
I agree with @ehoogeveen-medweb, now explicitly specifying Now there are three ways:
|
|
@JTischbein Here is my test cases and result:
Why "-dio" make the performance to be reduced? Could you confirm it on any Intel Core CPU (12th or newer) with iGPU (windows/linux)? Here is the test cmd: Thank you! |
|
@NeoZhangJianyu I have found the issue for the performance drop and probably why it is working on some devices with the SYCL backend and Vulkan backend. The current code uses an additional CPU buffer, from which the tensors are then copied into the pinned memory buffers (making it slower depending on the |
|
I don't fully understand this theory. These are UMA systems, are you suggesting that these pages are in a carveout region or something similar that makes them inaccessible for DMA? I don't think we have a way to detect that in the backend. Something like #18317 (comment) is the best we could try but that didn't work. Is it possible to handle this error in the model loader and fallback to the non-directio path? That seems like it would be a more robust solution. |
|
Another solution might be to allocate these buffers with (aligned)malloc and use the buffer_from_host_ptr function to make them GPU-accessible. This isn't currently implemented in ggml-vulkan but it's probably not too difficult to do so. |
|
@jeffbolznv Yes, that is currently my guess. Multiple people reported now that this PR fixed the reading issues and the difference between master and this PR is that the buffer pointer given to read points to a
In case that can be tested without too much effort it would be great to know whether |
@JTischbein Thank you! |
|
I test with old version: 93bb92664e37086cb4adab24f4911f7681724fce The --mmap is preferred firstly in SYCL backend. For SYCL backend, I satisfy the performance of load model in old version. I only hope the load model function is passed and performance is not reduced. Thank you! |
|
Hello, thanks for this PR - has anyone tested how this affects offloaded MoE models where the experts are memory-mapped from disk? Does this PR provide any speedup in this case? |
# Conflicts: # src/llama-mmap.cpp
|
After the merge (mmap = false, direct_io = false) prviously build that I have using this PR. Also -dio and -ndio is not implemented in the llama-bench. So I cannot run the benchmark. How much it is slower after the new merge. If you need any further logs from my side. Do let me know. |
|
@engrtipusultan Thank you for testing! Which backend are you using? And do you have some more logs? There should be a line like |
|
I am using vulkan. I created #18317 earlier. I am using Kernel: 6.14.0-37-generic OS: Ubuntu: 24.04 AMD Ryzen 7 5825U Vega 8 APU. 64Gb of RAM. Using BIOS UMA plus GTT or TTM module to make it accessible. Logs
|
|
After the latest commit. For me things appear to be back to normal with mmap false and direct io. Thank you very much. (mmap = false, direct_io = true) (mmap = false, direct_io = false) loading of model and inference both are slow. Which is fine by me since I can use above combination. But if something needs to be checked do let me know. |
Co-authored-by: Georgi Gerganov <[email protected]>
|
llama-bench is not having -dio or -ndio switch. Be default it is now showing |
Do we have a case in which |
|
@ggerganov So far With |
* Adding --direct-io flag for model loading * Fixing read_raw() calls * Fixing Windows read_raw_at * Changing type off_t to size_t for windows and Renaming functions * disable direct io when mmap is explicitly enabled * Use read_raw_unsafe when upload_backend is available, not functional on some devices with Vulkan and SYCL * Fallback to std::fread in case O_DIRECT fails due to bad address * Windows: remove const keywords and unused functions * Update src/llama-mmap.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: jtischbein <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
|
Just recompiled today and found this is causing a big performance regression for NUMA on my dual
Sorry I can't really help much more as at holiday home for next 3 weeks and have 4 young kittens who refuse to let me near the PC! 😼 If it's any help, then here is my script: #!/bin/bash
host_address=192.168.1.1
port_number=8080
# Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null
# Ask for permission to drop caches
read -p "Do you want to drop caches? (y/n) " -n 1 -r
echo # Move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
echo "Dropping caches..."
echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
fi
# Run the main command
CUDA_VISIBLE_DEVICES=0 GGML_OP_OFFLOAD_MIN_BATCH=2560 ~/llama.cpp/build/bin/llama-server \
--host "$host_address" \
--port "$port_number" \
--alias "local/kimi-k2-0905" \
--jinja \
--model ~/models/gguf/Kimi-K2-Instruct-0905-Q4_X.gguf \
--n-gpu-layers 99 \
--no-direct-io \
--numa distribute \
--threads "$(nproc)" \
--override-tensor exps=CPU \
--flash-attn 1 \
--parallel 1 \
--ctx_size 131072 \
--cache-type-k f16 \
--cache-type-v q4_0 \
--batch-size 14336 \
--ubatch-size 14336 \
--cache-ram 65536 \
--temp 0.6 \
--min-p 0.01 |
|
Weirdly, it may actually be giving me (a lot) better PP due improved I will try remote logging in using a tablet and Bluetooth keyboard tomorrow and see if I can run some proper experiments to see if I can figure it out... I suspect that |
|
It seems to be the NUMA "first-touch" policy causing the problems (the instantaneous drop caches was a red herring and just because it's not getting stored in the usual RAM buffers I think). It might be worth auto disabling this when using NUMA unless some better way can be found to use at least 1 thread per NUMA node to do the loading? |
Yeah, this doesn't work at all with NUMA as it's putting all the tensors on whatever node the thread is running on and then having to pass everything through the QPI interconnect (around 80GB/s for my So I went back to using a single NUMA node only and these are the results for
At first these seemed disappointing, but whatever this PR does to lay the tensors out in RAM means that I can now get way better offload PP to the
and this now brought the the break-even batch size for Not had chance to run it yet, but the break-even batch size for
Compared to |
Follow up for PR #18012 (comment).
To enable Direct IO model reading by default on Linux and Windows, but to stay with
--mmapas default on Mac, this PR adds an additional flag for enabling/disabling Direct IO. This flag is by default true and overrules the mmap parameter. In case--direct-iois true and Direct IO is available,--mmapgets disabled. And if--no-direct-iois set or Direct IO is not available (e.g. on Mac), the specified mmap value is used.